PB93-238186 


information  is  our  business. 


EXPRESSING  BOOLEAN  CUBE  MATRIX  ALGORITHMS 
IN  SHARED  MEMORY  PRIMITIVES 


THINKING  MACHINES  CORP. 
CAMBRIDGE,  MA 


1993 


»isTi®SKm6«j  a  ‘ 


PsalMDed  j 


National  Technical  Information  Service 


dtic  quality 


■OfLPECTED  1 


Matrix  Multipllcatlon^oolean  Cubes'tJsUj^Shared'li^ory  Primitives 

C.I  Ho  and  S.L  Johnsson,  Yaletrnivefsity 

/Bco/e^r, 

t /Y  S/t/'9-.^W  m  o fZ.Y 

P/5!>;t  J  ve-c, 


Thinking  Machines  Corporation 
Technical  Report  Series 


Reproduced  by: 

National  Technical  Information  Service 
U.  S.  Department  of  Commerce 
Springfield,  VA  22161 


DP88-1 


REPORs  DOCUMEMTAT^OM  PAGE 


colJeajon  of  mformation,  .nciuding  suggestions  for  reducing  this  burden  to  Washmcton  Headeuar-pr? ^  bu.den  esvimate  or  any  otr.er  aspect  of  this 
Davis  H,gh„,y,5u,ro  1204,  Arlington.  22202^302.  and  tl  t4  S«f<l 


.  PB93-238186  _ 

for  reviewing  instruaions.  searching  existing  data  sources 

.  rKie  * _ _ _ *  x  .  ' 


1.  AGENCY  USE  ONLY  (Leave  blank) 

!  2.  REPORT  DATE  « 

4.  TITLE  AND  SUBTITLE 

1  1988  e 

S  Tnog  S  5.  report  jyP'  AUcD  DATES  COVERED 

f  19oo  s  Technical 

4.  TITLE  AND  SUBTITLE  ^ - rs.  FUMDIMG  NC.VBERS. - 

^  SONR  N00014-86-K-056i| 

Expressing  boolean  cube  matrix  algorithms  in  :; 

shared  memory  primitives  i 


6.  AUTHOR{S)  — ■ — - — 

S.L.  Johnsson  and  C.  Ho  '■ 


7.  PERFORMING  ORGANIZATION  NAME(5}  AND  ADDR£SS(ES) 

Thinking  Machines  Corp. 

245  First  Street. 

Cambridge,  MA  02142-1264 


8.  PERFOSMfNG  ORGANI2ATIO?< 
REPORT  JeUMBES 


■TMC-  6l 


9.  SPONSORING /MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

0NR--  The  Pentagon 

Washington  DC  20350 


10.  SPONSORING /MONITORING 
AGENCY  REPORT  NUMBER 


11.  SUPPLEMENTARY  NOTES 


12a.  DISTRIBUTION /AVAILABILITY  STATEMENT 


2b.  DISTRiBUnCK  CODE 


13.  ABSTRACT  (Maximum  200  words)  '  '  '  

In  this  paper  the  focus  is  on  expressing  the  algorithms  in  shared 
memory  type  primitives.  We  assume  that  all  processors  share  the  same 
global  address  space,  and  present  communication  primitives  both  for 
nearest-neighbor  communication,  and  global  operations  such  as 
broadcasting  from  one  processor  to  a  set  of  processors,  the  reverse 
operation  of  plus-reduction,  and  matrix  transposition. 


14.  SUBJECT  TERMS 


Data  Parallel  Algorithms 


I  15.  NUMBER  OF  PAGES 

i _ 

16.  PRICE  CODE 


OF^REPCmT'^^^"''^™'^  r®’  I’®’  SECURITY  CLASSIFICATION  20.  LIMITATION  OF  ABSTRACT 

u»-  KtHOKF  OF  THIS  PAGE  OF  ABSTRACT  j  | 

Unclassified  SAR  SAR  |  SAR  | 

NSN  7540-01-280-5500  - ~ ^  ...  i 

Standard  Form  298  (Rev.  2-89) 
Prescribed  by  ANSI  Std.  Z39-18 
298-102 


Expressing  Boolean  Cube  Matrix  Algorithms  in  Shared  Memory  Primitives 


S.  Lennart  Johnsson^  and  Ching-Tien  Ho 
Department  of  Computer  Science 
Yale  University 
New  Haven,  CT  06520 
Johnsson@think.com,  Ho@cs.yale.edu 

DP88-1 


Abstract.  The  multiplication  of  (large)  matrices  allocated  evenly  on  Boolean  cube  config¬ 
ured  multiprocessors  poses  several  interesting  trade-offs  with  respect  to  communication  time, 
processor  utilization,  and  storage  requirement.  In  [7]  we  investigated  several  algorithms  for 
different  degrees  of  parallelization,  and  showed  how  the  choice  of  algorithm  with  respect  to  per¬ 
formance  depends  on  the  matrix  shape,  and  the  multiprocessor  parameters,  and  how  processors 
should  be  allocated  optimally  to  the  different  loops. 

In  this  paper  the  focus  is  on  expressing  the  algorithms  in  shared  memory  type  primitives. 
We  assume  that  all  processors  share  the  same  global  address  space,  and  present  communication 
primitives  both  for  nearest-neighbor  communication,  and  global  operations  such  as  broadcasting 
from  one  processor  to  a  set  of  processors,  the  reverse  operation  of  plus-reduction,  and  matrix 
transposition  (dimension  permutation).  We  consider  both  the  case  where  communication  is 
restricted  to  one  processor  port  at  a  time,  or  concurrent  communication  on  all  processor  ports. 
The  communication  algorithms  are  provably  optimal  within  a  factor  of  two.  We  describe  both 
constant  storage  algorithms,  and  algorithms  with  reduced  communication  time,  but  a  storage 
need  proportional  to  the  number  of  processors  and  the  matrix  sizes  (for  a  one-dimensional 
partitioning  of  the  matrices). 


1  Preliminaries 


Throughout  the  paper,  JV  =  2"  denotes  the  number  of  processors  of  an  n-dimensional  Boolean 
cube,  or  n-cube.  With  respect  to  algorithms  and  data  structures  we  factor  N  as  Ni  x  N2 
(2"i  X  2"2)  for  two-dimensional  partitionings  of  matrices,  or  as  x  N2  x  Nq  (2”5  x  2’*2  x  2’*3) 
for  three-dimensional  partitionings  (defined  later).  We  consider  the  matrix  operation  A  CxD, 
where  all  matrices  are  dense,  C  a.  PxQ  matrix,  D  aQxR  matrix,  and  A  a  P  x  .R  matrix. 
We  present  algorithms  for  different  initial  and  final  allocations  of  the  matrices:  one-dimensional 
(column  or  row),  two-dimensional  (block),  and  three-dimensional  partitionings.  The  two  initial 
matrices  C  and.jD  and  resulting  matrix  A  are  assumed  to  be  distributed  among  all  the  processors 
in  the  same  manner,  except  in  the  three-dimensional  case. 

A  matrix  element  is  assigned  to  only  one  processor  initially.  With  P,  Q,  and  R  being  powers 
of  two,  P  =  2P,  Q  =  2’,  and  P  =  2’’,  matrix  element  c,j  of  matrix  C,  0  <  t  <  P,  0  < 
3  ^  Q>  is  assigned  to  a  processor  as  shown  in  Table  1  for  one-  or  two-dimensional  partitionings, 
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Jnj— ijnj— 2  •  •  •Jo) 

Gray 

(^(*ni— 2  •  •  ‘^o)!! 
^(jn2— iJnj-Z  •  •  -  Jo)) 

Table  1:  Various  ways  of  assigning  matrix  elements  into  processors. 

consecutive  or  cyclic  storage  [6],  and  binary  or  Gray  code  encoding  [10,6].  In  the  two-dimensional 
partitioning,  each  column  (row)  is  assigned  to  Ni  {N2)  different  processors.  The  row  partitioning 
is  obtained  by  replacing  (j,g)  by  (i,p).  By  replacing  by  or  the 

processor  assignment  for  matrix  element  djk  (of  D)  or  aik  (of  A)  is  obtained. 

For  a  three-dimensional,  partitioning,  matrix  C  is  partitioned  as  JV3  block  columns  and  matrix 
D  is  partitioned  into  block  rows.  Each  block  column  of  C  and  each  block  row  of  D  are  further 
partitioned  into  N{  x blocks.  The  resulting  matrix  A  can  be  partitioned  into  N{  x blocks, 
or  as  N{N^  x  N2  blocks,  or  into  a  form  in-between  these  forms  with  the  same  communication 
complexity.  If  the  matrices  C  and  D  are  initially  partitioned  into  a  Ni  x  N2  processor  array,  then 
some  communications  are  required  to  rearrange  the  data  allocation  for  a  matrix  multiplication 
in  which  all  three  nested  loops  in  a  matrix  multiplication  algorithm  (expressed  in  a  conventional 
language)  are  parallelized.  This  communication  has  a  data  communication  time  that  is  of  lower 
order  than  the  data  communication  for  the  matrix  multiplication,  except  if  there  axe  very  few 
elements  per  processor. 

In  the  following,  all  algorithms  are  described  in  a  Crystal-like  notation  [2].  Each  instruction  is 
defined  as  a  function.  By  interpreting  the  first  one,  two,  or  three  parameters  as  processor  identi* 
fier(s)  in  the  one-,  two-,  or  three-dimensional  partitioning  cases,  parallel  codes  for  the  algorithms 
are  obtained.  The  communications  are  specified  assuming  a  global  address  space.  The  processor 
indices  are  part  of  the  global  address.  For  a  naive  implementation  of  the  communications,  for 
instance  by  using  a  noncombining  router,  and  without  using  multiple  paths  between  pairs  of 
processors,  efficiency  may  be  lost  due  to  poor  scheduling  (collisions),  or  poor  path  selection  (non¬ 
minimum  path  lengths,  single  paths).  We  expand  the  communication  primitives  (specification) 
into  a  sequence  of  nearest-neighbor  communications,  also  described  in  the  Crystal-like  notation. 
Execution  of  the  communication  code  replaces  the  high-level  communication  specification.  The 
communication  primitives  we  use  are  olUto^oll  hToodcosting  on  a  (sub)cube,  dlUto^Qll  reduction 
(in  a  divide- and-conquer  manner)  [9],  and  matrix  transposition  (dimension  permutation). 
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In  the  Crystal-like  codes  each  function  of  I  parameters  may  be  optionally  followed  by  an 
^pression  “  over  domaimx  domains  x  •••x  domain/”,  where  domain,-  is  the  domain  of  the 
tth  parameter.  [a:,j/],  y  >  x,  denotes  the  set  of  integers  {x,®  -I- 1,. .  and  [x,i/)  denotes 
—  l}.  The  statements  enclosed  between  and  ^  form  a  conditional  statement. 

For  example, 

cond\  — »  resulti, 
conds  —*■  Tesult2, 
else  — »  results 


reads  as  “if  condj  then  resulti,  else  if  conds,  then  results,  else  results”.  \f  [/(i)*5(i)|0  <  i  <  ®] 
denotes  5:j-^(/(j)  ♦p(j)).  We  use  c(i,i),  0  <  i  <  P,  0  <  j  <  g,  to  denote  the  matrix  element 
at  the  tth  row  and  jth  column  of  C.  d{j,k)  and  a{i,k)  are  similarly  defined.  For  matrices 
distributed  over  a  set  of  processors,  in  our  case  a  Boolean  cube,  it  is  more  convenient  to  identify 
a  matrix  element  by  processor  address,  and  the  relative  indices  of  the  local  submatrix.  Ic,  Id 
and  la  are  used  to  denote  the  local  submatrices  of  C,  D,  and  A,  respectively. 


We  use  Q  and  d  to  distin^sh  between  binary  encoding  and  Gray  code  encoding  of  the 

processor  id  (ptd),  i.e.,  a  =  pid  and  d  =  G{pid),  where  G  is  the  binary-reflected  Gray  code 
encoding  function. 


For  the  analysis  we  denote  the  communication  packet  size  by  B,  the  communication  start-up 
time  with  r,  the  time  for  the  transmission  of  an  element  by  tg,  and  the  time  for  an  arithmetic 
operation  by  For  the  communication  system  we  consider  one-port  communication,  for  which 
communication  only  can  take  place  on  one  port  at  a  time,  and  n-port  communication,  for  which 
all  ports  on  each  processor  can  be  used  concurrently. 


2  Communication  primitives 

The  communication  routines  we  use  for  matrix  multiplication  on  the  Boolean  cube  are  all-to- 
all  broadcasting,  all-to-all  reduction  and  matrix  transposition.  All-to-all  broadcasting  and  the 
reversed  operation  all-to-aB  reduction  are  described  in  detail  in  [9,11].  Matrix  transposition 
With  one-dimensional  partitioning  has  the  same  communication  pattern  as  all-to-all  personal- 
ized  communicatioi^],  ^  known  as  complete  exchange  [llj.  With  a  two-dimensional  square 
partitioning  into  VNxyN  blocks,  optimal  algorithms  are  described  in  [8,11].  For  the  transpo¬ 
sition  of  a  matrix  partitioned  into  NiX  Ns  blocks,  one  can  combine  the  one-dimensional  matrix 
transposition  algorithm  with  the  dgorithm  for  the  transposition  of  a  two-dimensionally  square- 
partitioned  matrix.  The  communication  complexities  of  various  algorithms  are  summarized  in 
11  ^  2,  3  and  4.  Note  that  the  complexity  of  the  all-to-all  reduction  is  the  same  as  that  of 

all-to-all  broadcasting,  if  the  number  of  elements  per  processor  before  the  reduction  is  the  same 
as  the  number  of  elements  per  processor  after  the  broadcasting. 
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Model 

Algorithm 

min  start-ups 

one-port 

SBT 

(N  -  1)M 

n 

n-port 

nRSBT 

jlllHBuamafiiiiiiifi 

n 

Table  2:  Communication  complexity  of  all-to-all  broadcasting  on  an  n^cube  with  M  elements 
per  processor  initially. 
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(iV-i)rt/ 
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n 

Table  3:  Communication  complexity  of  all-to-all  reduction  on  an  ri-cube  with  M  elements  per 
processor  initially. 
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one-port 

SBT 

riM 
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n 

Table  4:  Communication  complexity  of  all-to-all  personalized  communication  with  M  elements 
per  processor  initially. 
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2.1  One-dimensional  Matrix  Partitionings 

The  code  for  all-to-all  broadcasting  based  on  N  translated  Spanning  Binomial  Trees  (SBT’s)  [9] 
with  one-port  communication  is  described  below. 

/*  SBT  broadcasting.  */ 

/*  Row  direction,  one-port,  binary  enc.  */ 
lcJ>rdl(a,  t,/,  t)  over  [0  :  TV)  x  [0  :  P)  x  [0  :  2‘^)  X  [0  :  n]  = 

else  — *■ 

<  0  <  i'  <  ^  IcJbrdlia,  i,j\t  -  1), 

/*  Get  from  (t  —  l)th  nbr  and  append.  */ 
else  /c^rdl(a0  2*“^,t,/-  2‘~^S,t-  1)  », 

/*  Order  the  N  blocks  by  pid.  */ 

lcJtrd(Q,  i,j)  over  [0  :  x  [0  :  P)  x  [0  :  Q)  =  lcJbrdl(a,  i,j  ©  n) 


For  Gray  code  encoding,  a  is  replaced  by  d  and  lcJ>rd  is  redefined  as: 


/*  Order  the  N  blocks  by  G{pid).  *j 

lcJ>rd{a,  i,j)  over  [0  :  JV)  x  (0  :  P)  x  [0  :  Q)  =  lcJ,rdl{a,  i,  (G(  L^J )  ©  d)g  +  j  mod  n) 


With  n-port  communication,  all-to-all  broadcasting  based  on  N  distinct  translations  of  n 
Rotated  Spanning  Binomial  Trees  (nRSBT),  Spanning  Balanced  n-Trees  (SBnT)  and  n  Edge- 
disjoint  Spanning  Binomial  Trees  (nESBT)  [9]  are  all  optimal  within  constant  factors.  The 
algorithm  for  nRSBT  broadcasting  is: 

/*  nRSBT  broadcasting.  */ 

/*  Row  direction,  n-port,  binary  enc.  */ 

lcJ>rdl{a,  u, i',f,  t)  over  [0  :  iV)  x  [0  :  n)  x  [0  :  f )  x  [0  :  2*S)  x  [0  :  n]  = 

<  t  =  0  lc(a,  ttf  +  *',/), 
else  — ♦ 

<  0  <  /  <  2‘“» ^  lcJ>rdlia,  u,  i\  j',  t  -  1), 

else  lcJ>rdl(a  ©  u,  i',f  -  -  l)  », 

lcJ)rd{a,  i,j)  over  [0  :  JV)  x  [0  :  P)  x  [0  :  Q)  = 

lc-brdl(a,  ,  i  mod  (sh{  )Qa)^+j  mod  n) 


With  one-port  communzcotion,  the  code  is  described  below.  The  code  for  n-port  communi¬ 
cation  is  included  in  appendix  A. 
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/*  SBT  transpose.  */ 

/*  Column  partitioning,  one«port,  binary  encoding.  * / 
/cJxpl(a,t',i,t)  over  (0  :  jV)  x  (0  :  J)  x  [0  : 2‘g)  x  [0  :  n]  = 

<  t  =  0 lc(a,i'J), 

mod  2  =  0-+ 

<  0  <  j  <  2*-ig  -+  /cJxpl(Q,  -  1) 

else  -+  /cJxpl(a  ©  2’*“*,  i',j  -  2*“^ S,  t  -  1)  >, 

else 

<  0  <  i  <  2‘-i  g  -+  /cJxpl(Q  ©  2’*-*,  i'  +  j,  t  -  1) 
else  -+/cJxpl(a,t'  + 

/cJxp(a,  i',i)  over  [0  :  JV)  x  [0  :  j^)  X  [0  :  Q)  =  /cJxpl(a,  i\j,n) 


^Vitll  on6“port  coTntnunicdtiony  the  code  is  described  below.  The  code  for  n-povt  communi¬ 
cation  is  included  in  appendix  A. 

/*  SBT  reduction.  */ 

/*  Between  columns,  one-port,  binary  encoding.  */ 
lajredl(a,  i,  k',  t)  over  [0  :  iV)  x  [0  :  P)  x  [0  :  ^)  X  [0  :  n]  = 

<  t  =  0 /c(a,i,fcO, 

2  =  0  — ►  /a_redl(a,  t,  k',  t  —  1)  -1-  lajredl{a  ©  2"“*,  i,  k',  t  —  1), 
else  —  lajred\{a, i,  A:' -I-  t  -  1)  +  lajredl{a  ©  2"-',  i, k'  +  ^, t  -  1)  >, 

lajred{a,i,  over  [0  :  JV)  x  [0  :  P)  x  [0  :  ^)  =  lajredl{a,  i,  k',  n) 


2.2  Two-dimensional  Matrix  Partitionings 

All-to-all  broadcasting  based  on  the  SBT  and  nRSBT  routings  within  a  column  or  row  subcube 
are  the  same  as  in  the  one-dimensional  case,  see  appendix  A. 

Transposing  3.  P  x  Q  matrix  partitioned  into  N\  x  N2  blocks,  implies  that  the  processor 
that  holds  block  (t,j),  0  <  t  <  Ni,  0  <  j  <  JV2,  will  hold  block  (j,i)  after  the  transposition, 
^r  co^enience,  we  assume  that  the  shape  of  the  submatrix  defined  by  a  block  changes  from 
^  ^  ^  ^  ^  (instead  of  changing  to  a  ^  x  submatrix).  The  transposition  can 

be  decomposed  into  two  phases.  In  the  first  phase,  there  are  2l"2-"il  subcubes,  such  that  <»ach 
subcube  executes  a  transposition  of  min(JVi,  A^2)xmin(iVi,  N^)  blocks.  In  the  second  phase,  there 
^  ^  subcubes,  such  that  each  subcube  executes  a  one-dimensional  transposition. 

The  communication  complexity  is  derived  in  [7].  The  code  for  one-port  communication  is  given 
below. 


/*  SBT  tranpose  alg.  with  Ni  x  N2  partitioning.  * / 
f(t)  over  [0  :  n]  = 

<  t  <  2  min(ni,  n2)  -+  1, 

m  >  n2 
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else  2^ >, 

02, *',/,<)  over  [0  :  Ni)  x  [0 :  N2)  x  [0  :  x  [0  :  jj^)  x  [0  :  n]  s= 

<  t  =  0  -»■  /c(oi,  02,  *',/)» 
t  <  2min(ni,n2)  —* 

/*  two-dimensional  transpose.  */ 

^  l-2"i-i‘/*i  J  2  =  [yjjrfiTJrJ  2 

<  t  mod  2  =  1  /cJarpl(ai  ©  2”>-r*/2l^a2^ ,•/ y/ <  _ 
else  — ♦ 


<  t  mod  2  =  0  /cJa:pl(ai, 02  ®  2"*“f*/2l ,  i\f, t  -  1)  », 

/*  one-dimensional  transpose.  */ 

Til  <  712  — ► 

<  pgiq  mod  2  =s  0  — » 

<  0  <  /  <  2*'-^^  /cJxpl(oi, 02,  i',/,  t  -  1), 

else  /cJxpl(oi,02©2»-<,t',/-  2<'-i^,t-  1)  >, 
else  -*■ 

<0<r<  2‘'-ij|  ^  /cJxpl(0x,02  ©  _  1), 

else  /cJxpl(oi,02,t'-|-  5;;^,/-  -  1)  », 

else  —* 

^  2  =  0  — ► 

<  0  <  i'  <  2‘'-*^  -» /cJxpl(oi, 02, *■',/,<  -  1), 

else  —  /cJxpl(oi  0  2"-‘,  02,  i'  -  -  1)  >, 

else  -♦ 

<  0  <  t'  <  2‘'-^^  -f  /cJxpl(oi  ®2"-*,Q2,i',j'-|-  1), 

else  -  /cJxpl(oi,02,t'-2*'-i3^,/-|-  1)  >» 

where  t'  =  t -2  min(ni  ,712), 

/cJxp(oi,02,i',/)  over  [0  :  N-^)  x  [0  :  IVj)  x  [0  :  5^)  x  [0  :  ^)  =  /cJxpl(oi,02,z',/,n) 


With  n-port  communication,  one  can  either  run  the  n-port  version  for  the  two  phases  sepa¬ 
rately,  or  pipeline  the  two  phases.  However,  by  treating  the  transposition  as  a  stable  dimension 
permutation  [4,5]  and  employing  one  of  those  algorithms  a  communication  complexity  lower 
than  that  of  the  above  algorithm  can  be  obtained.  The  dimension  permutation  algorithm  is 
based  on  the  fact  that  the  tw’o  phases  can  be  reversed,  or  mixed,  preserving  the  permutation. 

3  Matrix  multiplication 

3.1  One-dimensional  partitioning 

We  consider  column  partitioning.  For  row  partitioning,  similar  algorithms  can  be  derived. 

•  Algorithm  Compute  A  in-place  by  broadcasting  of  C  from  every  processor 

that  has  elements  of  C  to  every  processor  that  has  elements  of  D.  Processor  a  =  PID{j) 
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Column  Partitioning: 

Ai;  1, 1) :  C.«,  7  C..,  D 


A-,  1,3) 


nipy.,  [R]  . 

•Of  -<  Am 


Cmay^moi 


txp.  C,  y 


r  n  n  n  i  ‘*P-  A.  . 

- -  Oa,.,X/.«  ==1^  =;  i>  Am 


n  txp.  D,  _  mpy.,  [Q|  red.  A,  <-►  . 

—  i'  C7.a,  i?a,  === ■■"■  V  .4?,  !>  A, 


Row  Partitioning: 

•4(*,  1,  1)  :  CamiDgm 

A-,  1,3) 

A-,1,4) 


A, 


n  ifP:  n  [Q1  a*  i  . 


Figure  1:  Notation  summary  of  algorithms  for  one-dimensional  partitioning, 

computes  CZ?(*,  Lj^J)  lor  all  j  mapped  to  a,  where  PJD  is  the  allocation  function  as 
described  in  Table  1. 

*  ^Isorithin  «4(')ly2).  Compute  A  by  a  transpose  of  C  and  broodcosting  of  C"^  from  every 
processor  that  has  elements  of  to  every  processor  that  has  elements  of  D.  Processor 
a  =  PID{j)  computes  CD{*,  Lj^J)  all  j  mapped  to  a. 

*  Algorithm  A(*)l)3).  Compute  A  by  a  transpose  of  C,  broodcosting  of  D  from  every 

processor  that  has  elements  of  D  to  every  processor  that  has  elements  of  C^,  and  transpose 

A^.  Processor  a  —  PID(j)  computes  C’([-jr^J,*)jD. 

1^1 

*  Algorithm  A(-,l,4).  Compute  A  in-space  by  a  transpose  of  D,  and  reduction  of  partial 
inner  products  of  A. 


The  algorithms  are  identified  by  A{number  of  ports  used  concurrently,  number  of  loops  par¬ 
allelized,  algorithm  identifier).  Algorithm  >l(-,l,2)  is  clearly  inferior  to  algorithm  ^(-,1,1)  with 
respect  to  commumcation  complexity,  and  is  not  further  considered  for  the  one-dimensional 
partitioning,  but  will  be  considered  for  the  two-dimensional  partitioning.  For  row  partitioning 
the  roles  of  C  and  D  are  interchanged.  Figure  1  characterizes  the  basic  algorithms,  the  corre¬ 
sponding  algorithms  for  row  partitioning  is  also  included  for  comparison.  The  two  subscripts 
denote  the  ordinal  numbers  of  block  rows  and  block  columns.  The  superscript  denotes  the  ordi¬ 
nal  number  of  the  partial  inner  product  result.  The  number  in  the  square  brackets  (eg.  [R]  in 
A(*,l,l))  is  the  number  of  processors  that  minimizes  the  arithmetic  time  for  each  algorithm. 

A  complete  matrix  multiplication  algorithm  based  on  rotations  of  the  matrix  C  is  given 
below: 

/*  A  Rotation  Algorithm  A(l,l,l).  */ 

I*  Colunm  partitioning.  Gray  code  encoding.  */ 
lc(a,i,j\t)  over  [0  :  IV)  x  [0  :  P)  x  [0  :  ^)  x  [0  :  JV)  = 
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<«  =  0-^c(i,d^  +  /), 

else  -♦  /c((d  +  1)  mod  Ny  iyfy  <-!)>, 
ld(&,j,k')  over  [0  :  ^■)  x  [0  :  Q)  x  [0  :  §)  -  d(i,dj^  +  k'), 

/c(d,  i,  k\  t)  over  [0  :  JV)  x  [0  :  P)  x  [0  :  x  [0  :  JV]  = 

<  t  =  0  0, 

else  la(a,  t,  A:',  t  -  1)  +  (\+  [/c(d,  t,  f,  <  -  1) 

*  /d(d,  ((d  + 1  -  1)  mod  JV)g  +  f,  fc')|0  <  f  <  §])  >, 
a(i,  k)  over  [0  :  P)  x  [0  :  P)  =  la(  ,  i,  k  mod  §,  N) 


A  naive  implementation  of  the  above  code  may  use  more  storage  than  necessary.  For  instance, 
each  processor  needs  to  store  all  the  N  column  blocks  of  C.  However,  a  reasonable  compiler  can 
resolve  this  problem  by  deallocating  unused  space,  or  by  using  shared  variables. 

Note  that  the  rotation  operation  implies  nearest-neighbor  communication,  if  d  and  (d  -|- 
1)  mod  N  are  in  adjacent  processors.  Since  d  is  the  Gray  code  encoding  of  the  processor  id, 
i.e.,  the  jth  block  column  is  stored  in  processor  pid  with  d  =  G(pid)  =  j,  rotation  of  C  implies 
nearest-neighbor  communications.  For  binary  encoding,  i.e.,  the  jth  block  column  is  stored  in 
processor  a  =  jf,  we  redefine  Ic  and  la  as  follows: 

/*  G(t)  is  the  binary-reflected  Gray  code  oft.  */ 

/*  G~^  is  the  inverse  function  of  G.  */ 

G-^(t)  =<  t  =  0  0, 

else  teG'-^CLjJ)  >, 

/c(q:,  i,  f,  t)  over  [0  :  x  [0  :  P)  x  [0  :  g)  x  [0  :  A")  = 

<  t  =  0~.c(t,a^ -»-/), 

else  lc(G~^((G{a)  +  1)  mod  1)  >, 

la{a,  i,  k\  i)  over  (0  :  A^  x  [0  :  P)  x  [0  :  ^)  x  [0  :  iV]  = 

<  t  =  0  -V  0, 

else  -f  la(a,  i,  k',  t  -  1)  -|-  (\-|-  [/c(q,  i,f,  t-1) 

*  ld{a,  G((G~\a)  +  t-l)  mod  +  j',  A:')l0  <  f  <  ^])  > 


Instead  of  all-to-all  broadcasting  through  rotations  a  Gray  code  exchange  algorithm  can  be 
used: 

/*  A  Gray  code  Exchange  alg.  A(l,l,l).  */ 

/*  Column  partitioning,  binary  code  encoding.  */ 

/*  T{t)  is  the  index  of  the  tth  transition  bit  in  the  Gray  code 
on  n  bits  =  the  number  of  trailing  I’s.  */ 

T{t)  =<  t  mod  2  =  0  — ►  0, 
else  + 

/c(a,  i,f,  t)  over  [0  :  AT)  x  [0  :  P)  x  [0  :  g)  x  [0  :  AT)  = 
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else  —*■  lc{a  © 

ld(aj,  k')  over  [0  :  JV)  x  [0  :  Q)  x  [0  :  ^)  =  d(j,  a§  +  ifc'), 
la(a,  i, k\t)  over  [0  :  iV)  x  [0  :  P)  x  [0  :  X  (0  ;  jV]  = 

<<  =  0-^0, 

else  la{a,  i,  k\  t  -  1)  +  (\+  [lc(a,  i,  f,  <  -  1) 

*  /d(a,  (a  e  G(t  -  l))g  +  r,  *0|0  <  j'  <  ^])  >, 

a(i,  A)  over  [0  :  P)  x  [0  :  P)  =  la{  ,i,kmod§,N) 


For  Gray  code  encoding,  the  Gray  code  exchange  algorithm  can  also  be  used.  The  code  is 
similar  and  is  included  in  appendix  B.  Note  that  the  rotation  algorithm,  and  the  Gray  code 
exchange  algorithm  can  be  viewed  as  one-dimensional  versions  of  Cannon’s  [1]  and  Dekel’s 
[3]  algorithms,  respectively.  The  encodings  only  affect  which  N  block  rows  of  D  within  each 
processor  interact  with  the  current  block  column  of  C. 

In  the  case  communication  can  take  place  concurrently  on  all  the  ports  of  a  processor,  the 
data  set  for  the  matrix  C  is  partitioned  into  n  equal  pieces.  Each  such  piece  is  broadcast  through 
a  unique  path.  In  the  case  of  the  Gray  code  exchange  algorithm  the  paths  are  obtained  through 
rotation  of  the  dimensions,  such  that  if  the  edges  in  dimension  T{t)  are  used  by  path  0  during 
step  t,  then  path  u  uses  the  edges  in  dimension  (T(t)  -}-  u)  mod  n  during  the  same  step. 


/*  A  Gray  code  Exchange  alg.  A(n,l,l).  */ 

/*  Column  partitioning,  binary  code  encoding.  */ 

/c(q, tt, over  [0  :  W")  X  [0  :  n)  x  [0  :  f )  x  [0  :  ^)  X  [0  :  iNT)  = 

else  -*  /c(a  ®  u,  t  -  1)  >, 

ld(aj,k')  over  [0  :  iV)  x  (0  :  Q)  x  [0  ;  j^)  =  d{j,a§  +  Jfc'), 

/*  Shuffle  (cyclic  left-shift)  u  steps  of  t.  */ 
sh(u,  t)  over  [0  :  n)  x  [0  :  iV)  =  (t  mod  2’*-“)2“  + 
la(a,  u,  t',  k',  t)  over  [0  :  N")  X  [0  :  n)  x  [0  :  f )  x  [0  :  ^)  X  [0  :  iV]  = 
<  t  =  0  -*•  0, 


else 


la{a,  u, i', k', t  -  1)  -f-  (\-|.  [/c(o, «,  i\f,  t  -  i) 

♦  ld(a,  (a  ©  {sh(u,  G{t  -  l))))g  +  /,  /;')|o  <  f  <  S])  >, 
a(t,  k)  over  [0  :  P)  x  [0  :  P)  =  la{  [!f\ ,  ,  i  mod  f ,  k  mod  N) 


Both  the  previous  algorithms  operate  with  constant  storage  requirements.  The  number  of 
cominunication  actions  is  linear  in  the  number  of  processors,  but  can  be  reduced,  if  there  exists 
sufficient  storage  to  employ  a  doubling  algorithm.  Note  that  by  using  a  high-level  specification 
for  the  communication,  the  code  below  is  independent  of  how  the  communication  is  realized,  and 
hence  independent  of  for  instance  network  topology,  and  low  level  communication  primitives. 

The  initial  allocation  of  C  and  P,  and  the  final  allocation  of  A  are  the  same  for  all  the 
algorithms  for  column  partitioning  that  we  consider.  The  allocations  are  shown  below,  and 
omitted  in  the  following. 
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/*  Initial  allocation  of  C  and  D.  */ 

/c(a,  *, iO  over  (0  :  iV)  x  [0  :  P)  x  [0  :  S)  =  c(i,  og  +  j'), 
ld(aj,  over  [0  :  JV)  x  [0  :  (?)  x  (0  :  j^)  =  d{j,  a§  +  k% 
/*  Final  location  of  matrix  A.  *f 
a(t,  over  [0  :  P)  x  [0  :  P)  =  la{  ,  i,  k  mod 


/*  A  Doubling  Algorithm  A(-,l,l):  */ 

IcJrd(a,i,j)  over  [0  :  JV)  x  [0  :  P)  x  [0  :  g)  =  mod  ^), 

Ia(a,  t,  *0  over  [0  :  JV)  x  [0  :  P)  x  [0  :  =  \+  [/c^rd(a,  tj)  *  ld(a,J,  Jt')|0  <J<Q] 


I*  Algorithm  A(-,l,3):  ♦/ 

/cJxp(a,t',i)  over  (0  :  INT)  x  [0  ;  x  [0  :  g)  =  +  i',j  mod  ^), 

ldJ)rd(a,j,  k)  over  [0  :  iV)  x  [0  :  g)  x  [0  :  P)  =  ld([^],j,k  mod  §), 

laJxp(a,  i',  k)  over  [0  :  JV)  x  [0  :  ^)  x  [0  :  P)  =  \+  [lcJ.xp{a,  i',  j)  *  IdJbrd^a,  j,  ^')|0  <j<  Q], 

la(a,  i,  *0  over  [0  :  iV)  x  [0  :  P)  x  [0  :  =  /aJxp([i^J ,  i  mod  a§  +  k') 


/*  Algorithm  A(*,l,4):  •/ 

ldJxp(aJ', k)  over  [0:N)x  [0  :  g)  x  [0  ;  P)  =  /d(L^J , +  j', k  mod  §), 

/a(Q,  i,  k)  over  [0  :  iV)  x  [0  :  P)  x  [0  :  P)  =  \+  [/c(q,  i,f)  *  ldJxp(a,f,  Jfc)|0  <  f  <  ^], 
la.red{a,  i,  k')  over  [0  :  JV)  x  [0  :  P)  x  (0  :  f )  =  \+  [/a(a',  i,  a§  +  Jk')|0  <  a' <  N] 


Table  5  shows  the  total  number  of  arithmetic  operations  in  sequence.  If  P,g,  and  P  all  are 
multiples  of  N,  then  all  three  algorithms  have  the  same  arithmetic  complexity.  For  P,Q,R>  N, 
the  differences  of  the  arithmetic  complexities  are  within  constant  factors.  Table  6  shows  the 
total  number  of  elements  transferred  in  sequence  and  the  minimum  number  of  start-ups  for 
^  ^ -  The  superscript  /  on  A  denotes  a  linear  array  algorithm,  and  superscript  c  a 
Boolean  cube  algorithm.  For  some  values  of  P,  g,  and  P  less  than  N,  the  communication 
complexity  can  be  smaller  than  what  is  given  in  the  table,  because  some  of  the  broadcastings 
and  personalized  communications  may  complete  earlier.  The  communication  complexity  for  the 
general  case  is  complicated  and  described  in  [7].  The  data  transfer  time  compares  as  PQ  :  QR  : 
PP,  approximately,  by  considering  the  highest-order  term  of  A(-,l,l),  A(-,l,3)  and  A(*,l,4)  and 
assuming  P,Q,R  >  N.  Note  that  for  ^  ^  the  communication  complexity  of  A(-,l,l) 

is  less  than  that  of  A(*,l,4),  which  in  turn  is  less  than  that  of  A(",l,3).  For  a  detailed  analysis, 
see  |7]. 


3.2  Two-dimeiisional  partitioning 

The  algorithms  described  for  the  one-dimensional  case  have  analogues  in  the  two  dimensional 
case.  Algorithm  A(",l,l)  that  computes  A  in-place  by  broadcasting  C  in  its  two-dimensional 


11 


Table  6:  ' 

P,Q,R> 


Algorithm 

Number  of  arithmetic  operations 

[gag™ 

lEBSSSI 

-i)+p(rsi+E?-,rsi) 

Table  5:  The  arithmetic  time  for  one-dimensional  column  partitioning. 


jAlgorithm 

Element  transfers  min  start-uns 

mSBSSSM 

■HDBi&liilBHi 

N-l 

(AT-DPr^^i 

n 

Zn 

Mil  HIM  1113^^ 

2n 

-4'(n,l,l) 

N-l 

n 

Sti 

‘4‘'(n,l,4) 

2n 

he  communication  complexity  using  one-dimensional  column  partitioning,  assuming 
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form  requires  broadcasting  of  dements  of  C  along  rows  and  broadcasting  of  elements  of  D  along 
columns.  The  two  broadcasting  operations  need  to  be  synchronized  in  order  to  conserve  storage. 
Cannon  [1]  has  described  such  an  algorithm  for  mesh  configured  multiprocessors  (that  can  be 
emulated  on  Boolean  cubes)  and  Dekel  et  al.  [3]  described  such  an  algorithm  making  use  of 
the  Boolean  cube  topology.  These  algorithms  are  special  cases  of  matrix  multiplication  using 
broadcasting  algorithms  that  preserve  storage  requirements. 

The  algorithms  corresponding  to  the  four  one-dimensional  algorithms  (>4(*,1,4)  has  two  vari¬ 
ations)  are 


•  Algorithm  A(*)2,l).  Compute  A  in-place  by  broadcasting  of  C  in  the  row  direction  and 
D  in  the  column  direction  such  that  each  processor  receives  all  elements  of  the  rows  of 
C  mapped  into  that  processor  row  and  all  elements  of  D  mapped  into  the  corresponding 
column  of  processors.  Processor  ai,02  then  computes  C7(['jr-p-rJ,*)i?(*,  L74^J)  for  all  i 

mapped  to  ai  and  all  j  mapped  to  aj.  The  communication  operations  are  broadcasting 
from  multiple  sources  within  rows  and  columns. 

Algorithm  ^(*,2^2).  Transpose  C,  perform  a  multiple  source  broadcast  along  processor 
rows  for  the  elements  of  in  that  processor  row,  and  accumulate  inner  products  for  A 
through  multiple  sink  reduction  in  the  column  direction  (of  the  processors).  The  accu¬ 
mulation  can  be  made  such  that  elements  for  each  column  of  D  are  accumulated  in 
each  processor  by  all-to-all  reduction.  A  processor  01,0:2  receives  C(+,  [j-^J)  during  the 

broadcasting  operation,  then  computes  the  product  C{*,  L )-C^(  j  The 

summation  over  index  i  is  the  reduction  operation  along  columns. 


•  Algorithm  A(*,2,3).  Transpose  C,  perform  multiple  source  broadcasting  of  the  elements 
of  D  within  processor  rows,  accumulate  inner  products  in  the  column  direction.  The 
multiple  sink  reduction  is  performed  such  that  each  processor  receives  all  elements  of 

distinct  columns  of  2?,  such  that  is  computed.  (Alternatively,  the  accumulation  can 
be  made  such  that  elements  for  each  column  are  accumulated  in  a  processor 

selected  such  that  the  proper  allocation  of  A  is  obtained  through  a  some-to-all  personalized 
communication  within  rows.)  Processor  01,02  computes  C’(Lj^J,  L|^J)X>([y^J,*) 

for  all  t,  j  such  that  L]^J  =  0^2  =  «!• 

•  Algorithm  A(-,2,4).  Transpose  D,  perform  a  multiple  source  broadcasting  of  the  elements 
of  within  processor  columns,  accumulate  the  partial  inner  products  for  elements  of 
A  by  multiple  sink  reduction  along  processor  rows  such  that  the  elements  of  at  most 

columns  are  accumulated  within  a  processor  column.  After  the  transposition  and 
broadcasting  processor  oi ,  02  has  the  elements  C(  » If^J )-^(  ^ 

that  I  1  ~  3Jid  j  such  that  J  =  0:2* 

Algorithm  v4(*,2,5).  Transpose  D,  perform  a  multiple  source  broadcasting  of  the  elements 
of  C  within  processor  columns,  accumulate  inner  products  for  elements  of  A  by  multiple 
sink  reduction  along  processor  rows,  such  that  each  processor  receives  elements  of  A^ 
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>1(-,2,1): 

n  ^  ~  brd.  D, 

mpy.,  [PR] 

^ma2 

>1(.,2,2): 

n  r-)  brd.  C,  4-* 

mpy.,  IQR]  red.  A.  t  • 

^•Ql  »  •^0fia2 

>1(.,2,3): 

T)  txp»  C»  /f  ^  j  brd.  D,  ^ 

> ‘^OriQr2  ^ 020^1  J  ^ OilQ2 

mpy.,  (PQ]  red.  A,  I  txp.  A,  ^  . 

. . — f''  ^OfiOrj 

^Oi2Cti  > 

>1(.,2,4): 

D  txp.  D,  /  ^  ^  brd.  D, 

mpy.,  (PQ)  red.  A,  ♦*  , 

^QlQ2  ’  ^02* 

A{;2,5): 

n  n  .!S?:  n  n  I. 

mpy.,  (QR)  red.  A,  **  ,  txp.  A.  /  , 

*  ^*ai  ^  ^0f2Ori  .  ^01102 

Cma2  ’  ^^2^1 

Figure  2:  Notation  summary  of  the  algorithms  for  two-dimensional  partitioning. 

for  each  of  columns  of  D.  Processor  01,02  computes  C(*,  L )-^( » Lf^J) 
for  all  *  such  that  Ly^J  —  J  =02- 

Figure  2  characterizes  the  5  algorithms.  The  two  subscripts  in  sequence  are  used  to  denote 
the  ordinal  numbers  of  block  rows  and  block  columns  of  the  Ni  x  JVj  blocks.  The  “k”  sign  means 
union  of  all  the  block  rows  (or  columns).  The  superscript  denotes  the  ordinal  number  of  the 
partial  inner  product  result.  The  number  in  the  square  brackets  (eg.  [PR]  in  .4(-,2,l))  is  the 
^  ihinimum  maximum  number  of  processors  to  minimize  the  arithmetic  time  for  each  algorithm. 
Algorithm  -4(',2,2)  has  a  matrix  transpose  in  addition  to  the  communication  of  C  as  in  algorithm 
*4(*,2,1).  But,  unlike  in  the  one-dimensional  case  algorithm  A(',2,2)  may  have  a  higher  processor 
utilization  than  algorithm  ^(*,2,1). 

The  broadcasting  in  >1(1, 2,1)  can  be  realized  by  a  rotation  algorithm,  which  yields  Cannon’s 
algorithm  [l].  Unlike  the  one-dimensional  case,  an  initial  alignment  is  required  in  order  to 
synchronize  between  the  rotations  of  C  and  jD.  For  Ni  =  N2  =  VW,  the  code  is  shown  below. 
For  Ni  ^  N2,  say  Ni  >  N2,  we  further  partition  the  submatrix  C  in  each  processor  into 

blocks  and  simulate  the  algorithm  for  Ni  x  Ni  blocks.  Each  processor  simulates  ^  processors. 
The  code  is  included  in  appendix  B. 

/*  Cannon’s  Algorithm  A(l,2,l):  */ 

/*  Assume  iVi  =  iVj  =  VN,  Gray  code  enc.  */ 

lc{ai,  02,  i’,j\  t)  over  [0  :  VF)  x  [0  :  \/iV)  x  [0  :  ^)  x  [0  :  ^)  x  [0  :  y/N)  = 

<  i  =  0  ^  c{oi:^  +  i', 02:^  +  3% 

else  -*•  /c(q:i,  (0:2  +  1)  mod  y/N,  1)  >, 

ld{ai,02,j\k’,t)  over  [0  :  y/1!)  x  [0  :  y/N)  x  (0  :  ^)  x  [0  :  ^)  x  [0  :  v^)  = 

<  t  =  0  d{ai-^  -H 3',  +  *')» 
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else  -►  W((qi  +  1)  mod  Vn,  02,  f,  k',t  -  1)  >, 

/a(di,  0-2,  over  (0  :  VW)  x  [0  :  y/W)  x  [0  :  x  [0  :  x  [0  :  y/N]  = 

<<  =  0-^0, 

else  /a(di,  d2,  t',  k*,t  -  1)  +  (\+  [/c(di,  d2,  i\f,  <  -  1) 

♦  ld(au&2,f,k\t  ~  1)|0  <  f  <  ^])  >, 
a(t,  k)  over  [0  :  P)  x  [0  :  i?)  =  lajred(  \}^\ ,  ,  t  mod  ^ ,  jfc  mod 


It  is  also  possible  to  design  a  matrix  multiplication  algorithm  based  on  the  SBT,  or  the 
nRSBT  communication  algorithms.  For  Algorithm  the  temporary  storage  for  each 

processor  becomes  ^  for  C  and  ^  for  D,  instead  of  ^  and  ^  for  Cannon’s  or  Dekel’s  algo¬ 
rithms.  However,  the  number  of  start-ups  is  reduced  to  0(ni  +  n2),  instead  of  0{Nx-\-N2).  Note 
that  the  initial  alignment  steps  can  be  eliminated.  It  is  possible  to  interleave  the  communication 
and  multiplication  steps  to  save  half  of  the  storage,.  However,  an  initial  alignment  is  required 
for  such  an  algorithm. 

The  initial  allocations  of  C  and  jD,  and  final  allocation  of  A  for  the  five  algorithms  below  are 
the  same,  and  is  described  once  and  for  all.  For  Algorithms  A(-,2,2)  and  .4(*,2,4),  la  is  replaced 
by  lajred. 

/*  Initial  allocations  of  C  and  D.  */ 

lc{aua2,t',f)  over  [0  :  Ni)  x  (0  :  N2)  x  [0  :  ^)  x  [0  :  3^)  =  ciai-^+  i',a2^+j% 
ld(ai,a2,f,k')  over  [0  :  IVi)  x  [0  :  ATj)  x  [0  :  ^)  X  [0  :  ^)  =  d{ai^  +  j\a2^  +  k% 

/*  Final- allocation  of  A.  */ 

a{i,  k)  over  [0  :  P)  x  [0  :  P)  =  /a(  [i^J ,  J ,  i  mod  k  mod  ■^) 


/*  A  Doubling  Algorithm  A(-,2,l):  */ 

lcjrow{ai,a2,i\j)  over  [0  :  Ni)  x  [0  :  N2)  x  (0  :  ^)  X  [0  :  (?)  =  /c(oi,  mod  ^), 

ld-col{aua2,j,k*)  over  [0  :  Ni)  x  [0  :  N2)  x  [0  :  (?)  x  (0  :  ]^)  =  /d([^J,02,i  mod 
/a(Qi, 02,  i',  k')  over  [0  :  ATj)  x  [0  ;  ATj)  x  [0  :  X  [0  ;  ^)  = 

\+  [/c_rou;(oi,02,t',j)  *  ldjcol{ai,a2,j,k')\Q  <j<Q] 


/*  Algorithm  A(-,2,2);  */ 

/cJxp(oi,O2,t',j0  over  [0  :  Ni)  x  [0  :  N2)  x  [0  :  3^)  X  [0  :  ^)  = 

ML(<J'il^  +  *0/^J»[(a2^  +  /)/^J>(ai]^  +  *')mod  ;^,(o23^-|- iOmod  ^), 
lcJxpjrow(ai,a2,  t,i')  over  [0  :  Ni)  x  [0  :  N2)  X  [0  :  P)  x  [0  :  = 

lcJxp{ai »  L^J ,  i  mod  j'), 

/a(Qi,02,i,  k*)  over  [0  :  Ni)  x  [0  :  N2)  X  [0  :  P)  x  [0  :  ss 

\+  [lcJxpjrow(ai,  02,  ij')  *  ld{ai,  02, j',  k')\0  <  f  <  ^], 
lajred{a-i,Q2,i\k')  over  [0  :  N{)  x  (0  :  N2)  x  [0  :  ^)  x  [0  :  = 

\+  +  i',fc')|0  <  oj  <  Ni] 
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Algorithm 

Number  of  arithmetic  operations 

>^"(•,2,1) 

>^"(-.2,2) 

(2r*i  +  r*ir*i 

-47., 2, 3) 

(2r*i  -  Dfifc!  + Es.  m  rii  +  fi&i  rie-i 

A7.,2,4) 

_(2ri^i  -  i)r7^i^+Er=if7^i  r^i  + 

^“(.,2,5) 

Table  7:  The  communicatioa  complexity  for  optimum  buffer  sizes,  two-dimensional  partitioning, 
and  one-port  communication. 

/*  Algorithm  A(-,2,3);  */ 

fcJxp(ai,Q2,  *■',/)  over  [0  :  J\^i)  x  [0  :  N2)  x  [0  :  3^)  x  [0  :  = 

+  +  0  mod  ^,(03^  + /)  mod  ^), 

ldjrow{ai,a2,f,k)  over  [0  :  J\^i)  x  [0  :  N2)  x  [0  :  ^)  X  [0  :  i2)  = 

/d(ai,  L^J,/,A  mod  ^), 

laJxp(ai,a2,  i',k)  over  [0  :  Ni)  x  [0  :  N2)  x  [0  :  x  [0  :  .R)  = 

\+  [ic-ixp(ai,a2,i',j')  *  ldjrow{Qi,a2,j\k)\0  <j< 
laJxpjred(ai,a2, i\  k')  over  [0  :  Ni)  x  [0  :  ^^2)  X  [0  :  ■^)  x  [0  :  = 

la(ai,a2,  i', k')  over  [0  :  Ni)  x  [0  :  A^2)  x  [0  :  ^)  x  [0  :  5^)  = 

laJxp.redi[ai-^  +  1.02]^  +  +  V)  mod  (qj^  -f-  Id)  mod 


Algorithms  A(-,2,4)  and  A(-,2,5)  are  included  in  appendix  B. 

Table  7  shows  the  total  number  of  arithmetic  operations  in  sequence.  Note  that  if  P,  Q 
and  R  are  multiples  of  Ni  and  N2,  then  the  arithmetic  complexities  of  the  algorithms  are  the 
same,  and  indeed  the  same  as  for  a  one-dimensional  partitioning.  Table  8  shows  the  total 
number  of  elements  transferred  in  sequence  and  the  minimum  number  of  start-ups  with  one- 
port  communication.  By  using  some  approximations,  the  values  of  Ni  and  N2  that  minimize  the 
number  of  elements  transferred  for  different  algorithms  are  shown  in  Table  9.  The  resulting  total 
complexities  are  shown  in  Table  10.  By  considering  the  highest-order  term,  the  data  transfer 
times  compare  as  Q  :  P  :  R  :  R  :  P  from  A(l,2,l)  to  A(l,2,5).  It  can  be  shown  [7]  that  for  P, 
Q  and  R  being  multiples  of  IVi  and  N2,  the  complexities  of  algorithms  A(',2,3)  and  A(*,2,5)  are 
always  higher  than  that  of  min(  A(-,2,2),  -4(-,2,4)),  if  the  optimum  values  of  Ni  and  N2  are  chosen 
for  each,  algorithm.  Table  11  shows  the  communication  complexity  with  n-port  communication 
and  optimum  packet  size.  For  a  detailed  analysis  and  optimum  choice  of  algorithms,  see  [7]. 
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Algorithm 

Element  transfers 

min  start-ups 

.4^(1, 2,1) 

+w-i)r4ir*i 

n 

.4'(1,2,2) 

+ri&ir4i(^i-i) 

2n 

>(1,2,3) 

+r#ir4iw-i)+mr;&in 

Sti 

^^1,2,4) 

f4irwi"+wr4iw-i) 

+r4ir4i(^2-i) 

2n 

i 

>i"(l,2,5) 

+  Ticrl  - 1) 

+r77rl  r777l(-^2  - 1)  + 

3n 

Table  8:  The  communication  complexity  using  two-dimensional  partitioning. 


Algorithm 

Nx 

N2 

.4'(1,2,1) 

fRN 

y-p- 

>1"(1,2,2) 

vf 

fW 

y~T 

>t"(l,2,3) 

>l^(l,2,4) 

fPN 

\J-!r 

>l'(l,2,5) 

[rn 

~w 

Table  9:  The  optimum  values  of  Ni  and  N2  for  P,  Q  and  ii  being  multiples  of  N  and  one-port 
communication. 


Algorithm 

Tmin 

44i<.  +  4r(2VPie-^)<.  +  nr - 

A‘(1,2,2) 

+ ^(2\m + +i^;~ 

>1^(1,2,3) 

+  :^(2,/P5+  +  3nr 

>l'(l,2,4) 

+  :^(2v^+  +  2„r 

>^*(1,2,5) 

2^1.  +  :^(2V^  +  ^ 

Table  10:  The  total  complexity  with  optimum  values  of  Nx  and  N2  for  P,  Q  and  P  being 
multiples  of  N  and  one-port  communication. 
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Algorithm 

min  communication  time 

^‘=(n,2,2) 

+  \/(n  -  l)r)-' 

•4''(n,2,4) 

A%n,2,5) 

+  (\/r^l  r^l<c  +  \/(n  “  l)r)‘‘ 

+(\/r*i 

+(r7^ir'^l“ir“  +  f/Trl  F77rl“^)*c 

Table  11:  The  communication  complexity  for  optimum 
ing,  and  n-port  communication. 


uuiici  twu-uimensionai  parti  iion- 


3.3  Three-dimensional  partitioning 

In  the  case  of  a  three  dimensional  partitioning  of  the  Boolean  cube  each  x  subset  of 
processors  compute  the  product  of  a  P  x  ^  matrix  and  a  ^  x  P  matrix.  If  the  matrices 

are  initially  allocated  such  that  there  are  distinct  submatrices  P  y.  §r  and  ^  x  P  assigned 
to  each  set  of  ^  processors  then  the  multiplication  in  each  subset  is  the  same  as  in  the  two 
dimensional  partitioning,  except  that  Q  is  replaced  by  In  addition,  there  is  an  accumulation 
phase  at  the  end.  The  number  of  arithmetic  operations  for  this  part  of  the  computation  is 
r^l  log -^3  without  any  pipelining,  and  all  partial  products  being  accumulated  in  the  same 

way.  The  matrix  A  is  allocated  among  ^  processors.  If  there  are  several  elements  of  the  matrix 
A  that  are  stored  in  the  same  processor,  then  the  accumulation  can  be  made  faster  by  usine 

all-to-all  reduction.  The  arithmetic  complexity  becomes  Yltli  [— l^a-  The  communication 


complexity  for  the  reduction  is  Erli  1  +  ESi  I '  1  r.  When  >  N^,  it 

is  an  olUto^all  reduction^  and  the  communication  complexity  of  the  reduction  is  approximately 

~  1^-  For  detailed  complexity  analysis,  see  [7].  The  code  is 

given  below. 
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/*  Algorithm  A(-,3,l):  */ 

/*  Matrix  C  is  partitioned  as  N{x  */ 

/c3(ai,Qf2,a3,*',i'0  over  [0 :  N{)  x  [0  :  N^)  x  [0  :  Ni)  x  [0  :  ^)  x  [0  :  j^)  = 
c^oti-gr  +  i',  a3-§T  +  ^23^  + 

/*  Matrix  D  is  partitioned  as  N{Nq  X  N^.  */ 

ldZ{ofi,a2,ocz,j\k")  over  [0  :  JV')  x  [0  :  N^)  x  [0  :  N^)  x  [0  :  j^)  x  [0  :  ^)  = 

d{az§T  +  ai  +  i",  £*2^  +  *')» 

/*  Broadcast  C  along  JVj  direction.  */ 

lcZjrow{au  ^2,  <*3,  iO  over  (0  :  N{)  x  [0  :  N^)  x  [0  :  N^)  x  [0  :  ^)  x  [0  :  ^)  = 
/c3(qi,  ti^J,03,i',j'inod  3^), 

/*  Broadcast  D  along  N{  direction.  */ 

ldZxol(ai,  Q2, 03,  j',  k')  over  [0 :  JV')  x  [0  :  N^)  x  [0 :  N^)  x  [0 :  ^)  x  (0  :  ^)  = 

W3(L^^J,Q2,03,j'mod  3^,&0, 

/*  Compute  partial  inner  product  locally.  */ 

Ia2(ai,a2,az,i\k')  over  [0  :  x  [0  :  N2)  X  [0  :  N^)  x  [0  :  ^)  X  [0  :  ^)  = 

\+  [lcZjrov}{ai,a2,Qz,i\f)  *  ldZxol{ai,a2, 03>  j'»*0|0  <  f  <  §t],  * 

/*  Reduction  along  JV3  direction.  */ 

/c3j-ed(ai,Q2,a3,t',fc')  over  [0  :  iV')  x  [0  ;  N^)  x  [0  :  N^)  x  [0  :  3^)  X  [0  :  3^)  = 
\+  [/a3(Qi, 02, 03,  L~7^J 3^  +  *'> (03  mod  3^)3^  +  &')|0  03  <  J\r|], 

/*  Relabeling  processor  indices  as  two-dimensional.  */ 

/o2(ai,  Q2,  k')  over  [0  :  Nx)  x  [0  :  N2)  x  [0  :  3^)  X  [0  :  3^)  = 

lo3j-e<i([2^J,  mod  +  o;2  mod  ^,i\k*), 

I*  Resulting  matrix  A  is  partitioned  as  Nx  x  N2-  */ 

c(i,  k)  over  (0  :  P)  X  (0  ;  R)  =  /o2(  ,  L^J ,  i  mod  3^,  fc  mod  3^) 


Note  that  in  the  above  algorithm,  the  matrix  A  is  partitioned  into  Nx  x  N2  blocks  with  no 
extra  communication  after  the  reduction  step.  Depending  on  how  the  data  set  is  divided  during 
the  reduction  steps,  the  resulting  matrix  A  can  be  partitioned  into  N[  x  N^N^  blocks,  N'x  N^  x  N^ 
blocks,  or  some  blocking  scheme  in-between  those  two. 


If  the  matrices  C  and  D  initially  are  partitioned  into  Nx  x  N2  blocks,  then  transformations 
are  required  to  change  the  allocation  into  N'x  x  N^N^  blocks,  and  NiN^  x  N^  blocks,  respectively. 
The  transformation  can  be  specified  as  follows. 

/c3(ai,Q2,a3,i',/0  over  [0  :  N{)  x  [0  :  N^)  x  [0  :  N^)  x  [0  :  ^)  X  [0  :  3^)  = 

/c2(ai^  +  L^J»«3^  +  L^^J,i'mod  3^, (02  mod  + 

ldZ{ax,a2,az,j",k')  over  [0  :  N{)  x  [0  :  N^)  x  [0  :  JV')  x  [0  :  3^)  x  [0  :  ^)  = 

ld2{az^  +  +  L^J>(ai  mod  ^^)j^  +  f',k'  mod  3^) 
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4  Conclusion 


We  have  shown  how  algorithms  for  distributed  architectures,  such  as  a  Boolean  cube,  can  be 
expressed  in  terms  of  a  shared  global  address  space,  and  how  the  translation  between  local  and 
global  addresses  can  be  carried  out.  We  have  also  shown  how  the  network  and  low  level  com¬ 
munication  features  of  the  architecture  can  be  encapsulated  into  generic  global  communication 
primitives,  such  as  all-to-all  broadcasting  within  a  (sub)cube,  all-to-all  reduction,  matrix  trans¬ 
position  (dimension  permutation).  These  primitives  can  either  be  integrated  into  compilers,  or 
incorporated  into  the  communication  system  by  providing  dilFerent  communication  modes.  The 
communications  would  be  transparent  to  the  user.  The  architectural  dependence  is  hidden  in 
the  communication  primitives.  The  algorithms  for  matrix  multiplication  that  we  have  used  for 
illustration  cover  algorithms  that  parallelize  one,  two,  or  all  three  loops  of  a  matrix  multipli¬ 
cation,  and  for  each  degree  of  parallelization  algorithms  that  are  optimal  for  different  matrix 
shapes  and  architectural  parameters. 


Appendix 


A  Communication  primitives 

A.l  One-dimensional  partitioning 

/*  An  nRSBT  transpose  algorithm  (column  part.,  n-port).  */ 
fcJxpl(a, u, i',f, t)  over  [0  :  IV)  x  [0  :  n)  x  [0  :  ^)  =  x[0  :  2*^)  x  [0  :  n] 
<  t  =  0  —  lc(a,  i\  u:^  +  f), 
mod  2  =  0-* 

<  0  <  i  <  lcJxpl{a,u,i',f,t-  1) 

else  -*  lcJxpl{a  ®  u,  i',f  -  2*-^  -  1)  >, 

else  -* 

<  0  <  i  <  2*-'^:^  -*  lcJxpl(a  ©  2(“-*)n«><in,  u,  i'  +  1) 

else  ^  lcJxpl{a,  u,  i'  H-  ^,j'  -  2*-’^-^,  t  -  1)  », 

lcJxp(a,  V,j)  oyer  [0  :  JV)  x  [0  :  ^)  x  [0  :  (?)  = 

lcJxpl(a,  mod  n,  i\  mod  :^,n) 


/*  nRSBT  reduction.  */ 

/*  Between  columns,  n-port,  binary  encoding.  */ 
lajredl{a,  u,  Id,  t)  over  [0  :  iV)  x  [0  :  n)  x  [0  :  f )  x  [0  :  ^)  X  [0  :  n]  = 

<  t  =s  0  la{ot,  sh{u,  L^J)^  +  k'  mod  ^), 

L2r»-«?n.ednJ  mod  2  =  0^  /c j*edl(o,  u,  i',  k’,  t  -  1)  -f  lajredl{a  ©  u,  i',  k',  t-\), 

else  lajredl{oL,  u,  i',  t  _  i)  +  lajredl{a  ©  u,  i',  F  -f-  t  -  1)  >, 
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la.red(a,  i,  *0  over  [0  :  jV)  x  [0  :  P)  x  [0  :  j^)  =  lajredl(a,  ,  i  mod  f ,  k',  n) 


A. 2  Two-dimensional  partitioning 

/*  SBT  broadcasting  (row  direction,  one-port).  */ 

/cj-oti7l(Qi,a2,:',/,t)  over  [0  :  Ni)  x  [0  :  N2)  x  [0  :  3^)  x  [0  : 2‘^)  x  [0  :  nj]  = 

<t  =  0-./c(oi,a2,t',A 

else  <  0  <  /  <  2*”^^  -*■  lcjroivl(ai,a2,i',j\t  -  1), 

else  -+  lcj'owl{ai,Q2  ©  2‘~^i^/-  -  1)  », 

lcjrow(au  02,  j)  over  [0  :  Ni)  x  [0  :  N2)  x  [0  :  ^)  X  [0  :  Q)  =  lcjrowl(ai, Q2,  i'J  ©  ^2]^,  n2) 


/*  nRSBT  broadcasting  (row  direction,  n-port).  */ 

/cj-ot£;l(Q;i,a2,  w,i',/,t)  over  [0  :  Ni)  x  [0  :  N2)  X  [0  : 712)  x  [0  :  x  [0  :  2*-§-)  x  [0  : 712)  = 

<  t  =  0  /c(ai,  02,  +  i'J'),  * 

else  -+  <  0  <  j'  <  2*-'^  /cj'0t£;l(oi,O2,«,i',/,t  -  1), 

else  -*•  Icjr-owl^Qi, 02  ©  , u, i*,j'  -  2*“^ », 

Icjrow^ai,  02,  u,  i',j)  over  [0  :  Ni)  x  [0  :  N2)  X  [0  ;  712)  x  [0  :  x  [0  :  Q)  = 

i'  mod  (sh(u,  if^J)  ®  a,)^, o,) 


/*  SBT  broadcasting  (column  direction,  one-port).  */ 

ldu:oll(ai,  02,  f,  k',t)  over  [0  ;  Ni)  x  [0  :  N2)  X  [0  :  2*^)  x  [0  :  #)  x  [0  :  th]  = 

<t  =  0-/d(oi,O2,/,n 

else  ^  <  0  <  /  <  ^  /d-co/l(oi,02,i',fe',t-  1), 

else  ldxoll{ai  ©  2*“'',02,j'  -  -  1)  », 

W^o/(oi,O2,i,/:0  over  [0  :  Ni)  x  [0  :  N2)  x  [0  :  Q)  x  [0  :  ^)  =  /d^o/l(oi,02,i  ©  oi^,fc',77i) 


/*  nRSBT  broadcasting  (column  direction,  n-port).  */ 

/d^o/l(oi,02,tt,/,^*',t)  over  [0  :  J\ri)  x  (0  :  N2)  x  [0  :  Tii)  x  [0  :  2*§-)  x  [0  :  ;^)  x  [0  :  tii]  = 
<  t  =  0  -♦  /d(oi,02,i',u^j^  +  fc'), 

else  <  0  <  /  <  2‘-i^  ^  W^o/l(oi,02, -  1), 

else  /d^o/l(oi  ©  2(“+*-i)«‘o<*’»» ,  02, «,/  -  2<-i  3^,  fc',  t  -  1)  », 

/d^o/(oi,  02,  k)  over  (0  :  N-y)  x  [0  ;  7^2)  x  [0  :  tii)  x  [0  :  Q)  x  [0  :  ;^)  = 

W^oll(oi, 02,  mod  (s/i(u,  )  ©  Oi)^,  Til) 
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B  Matrix  Multiplication 

B.l  One-dimensional  partitioning 

I*  A  Gray  code  Exchange  alg.  A(l,l,l).  */ 

/*  Column  partitioning,  Gray  code  encoding.  * / 
lc(a,  i,f,  t)  over  (0  :  iV)  x  [0  :  P)  x  [0  :  ^)  x  (0  :  i\r)  = 

<t  =  0-.c(z,Q^  +  /), 

else  lc(G(G-\a)  0  *,/,  t  -  1)  >, 

/d(d,i,  *0  over  [0  :  A)  x  [0  :  Q)  x  [0  :  ^)  =  d(j,a§  +  k% 
la{&,  t,  k',  t)  over  [0  :  AT)  x  [0  :  P)  x  [0  :  X  [0  :  iV]  = 

<  t  s  0  0, 

else  la(&,  i,  k',  f  -  l)  +  (\+  [/c(q,  i,  j\  t-1) 

*  ld{&,  (d  0  (t  -  1))S  +  f,  k')\0  <  f  <  ^])  », 
a(t,  k)  over  [0  :  P)  x  [0  :  P)  =  la(  ,  i,  k  mod  N) 


B.2  Two-dimensional  partitioning 


The  index  /  in  the  following  code  denotes  the  rank  of  the  blocks  within  each  processor. 

The  nuniber  of  the  communication  steps  after  the  initial  alignment  is  2max(Ari,  N2)  -  2  in  the 
code.  It  is  possible  to  reduce  it  to  iVi  +  ^^2  ~  2  by  a  more  complicated  code. 

/*  Cannon’s  Algorithm  A(l, 2,1);  ♦/ 

/*  Nmax  =  max(Ni,N2)  and  Nmin  =  Tain(NuN2).  */ 

(c(a,, i',/,  <)  over  (0  :  J\'.)  X  [0  :  A',)  X  [0  :  fcj)  X  [0  :  X  (0 ;  ^)  X  [0  :  AT™.,]  = 

<  t  =  0  -*•  c(qi^  +  i',Q2^  +  ^7^  +  /). 

/*  Initial  alignment.  */ 

♦  —  1  ;  /  I  ((“2^+/+oti)modAri)JV2 

*  -  1  L  JV, - J .  (^  +  Q:i)  mod  i'J',  0), 

/*  The  last  block  gets  from  next  proc.  */ 

^  ~  ^  ^  (“2  +  1)  mod  N2, 0,  i', /,  t  -  1), 

/*  Other  blocks  get  from  right  locally.  */ 
else  ^  /c(ai,  03, 1  +  1,  t',/,  t  -  1)  >, 
else  —*• 

<  t  =  0  c(ai3^  +  +  *'><22^+/), 

t  =  1—*  lc(ai,{ai^  +  I  +  02)  mod  N2,l,i*,j',0), 
else  /c(qi,  (03  +  1)  mod  ATj,  /,  i\j\  t  -  1)  », 

‘ X [" •  fe) X 1° •  7&r) X [0 :  sb) X (0 : Jv„„i  = 

<  t  =  0  d(a,^  +  +  *'), 
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^  =  fc  -  1  +  1)  Nl, 0^2,0,  f,k\t  -  1), 

else  -»W(ai, 02,/  +  !, /,*',<-!)>, 
else  -♦ 

<  t  =  0 <i(ai^+ /,  02]^  + /^  +  fc'), 

t  =  1  -►  W((a2^  +  /  +  o:i)  mod  JVi,a2j,i',fc',0), 
else  -»  W((ai  +  1)  mod  Ni,a2,l,f,k',t  -  1)  », 
la(ai,a2,l,i\k',t)  over  [0  :  Ni)  x  [0  :  N2)  k  [0  :  x  [0  :  x  [0  :  jj^)  x  [0  : 

t  s=  0  — 0, 

else  -♦  /a(Qi,  ©2, 1,  k',  t  -  1)  +  (\+  [/c(ai,  02,  /,  i\j\  t  -  1) 

♦  W(ai,  02, -  1)|0  <  /  <  >, 

a(i,  fc)  over  [0 :  P)  x  [0  :  5)  =  ”*** 

<Ni>N2-^  /a(L^J,  mod  mod  ■g^,k  mod 

else  -♦  la{  mod  i  mod  k  mod  JV2)  > 


/*  Algorithm  A(*,2,4);  */ 

ldJxp(ai,a2,f,k')  over  [0  :  JVi)  x  [0  :  N2)  x  [0  :  ^)  x  [0  :  ^)  = 

+  /)/ 1(02^  +  fcO/^J>(ori^  +/)  mod  3^,  (0:23^  +  *0  mod  3^), 
Id Jxpjrov}(ai,  02,  ij')  over  [0  :  x  [0  :  N2)  x  [0  :  ^)  X  (0  :  P)  = 

/dJaj)(Qi,  [4^J,/,Ar  mod  3^), 

/a(ai, 02, *■', *)  over  [0  :  JNTi)  x  [0  :  jV2)  x  [0  :  3^)  x  [0  :  P)  = 

\+  Pc(«i.  02,  i'J')  *  ldJxp.col(ai,  02,  f,  fc)|0  <  f  <  ^], 
la.red{oi,02,i',k')  over  [0  :  Ni)  x  [0  :  N2)  x  (0  :  3^)  X  [0*  #)  = 

\+  [lfl(0:i,O2,  *',023^  +  ^  O2  <  JV2] 


/*  Algorithm  A(-,2,5):  */ 

ldJ.xp{oi,02,j',k')  over  [0  :  Ni)  x  [0  ;  N2)  x  (0  :  ^)  x  [0  :  3^)  = 

+/)/&,  +  ‘')/^J, 

(“1  ^  + /)  mod  (aj)^  +  i')  mod 

lc-col(oi,02,i,j')  over  (0  :  Ni)  x  [0  :  N2)  x[0:P)x[0:  S-)  = 

K  L^J .  02,  *  mod  3^,  f), 

laJxp{oi,02,i,k')  over  [0  :  Ni)  x  [0  :  P’2)  x  [0  :  P)  x  [0  :  3§-)  = 

\+  [lcj:ol{oi,02,i,j')  *  Id  Jxp{oi,  02,  j',k')\0  <j<  ^]l 
laJxpjred(oi,02,i\k')  over  [0  :  Pi)  x  [0  :  P2)  x  [0  :  ^)  x  [0  :  ^)  = 

\+  [la Jxp{oi, 0*2, 023^  +  &0|0  <  0!2  <  P2], 

laiai,Q2,V,k')  over  [0  :  Pi)  x  [0  :  P2)  x  [0  :  3^)  x  [0  :  3^)  = 

laJxp.redi[oi-^  +  [023^  +  *73^/,  (013^  +  V)  mod  3^,  (02^  +  k')  mod  3^) 
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